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ABSTRACT 

The fundamental building block of social influence is for one 
person to elicit a response in another. Researchers measur- 
ing a "response" in social media typically depend either on 
detailed models of human behavior or on platform-specific 
cues such as re-tweets, hash tags, URLs, or mentions. Most 
content on social networks is difficult to model because the 
modes and motivation of human expression are diverse and 
incompletely understood. We introduce content transfer, an 
information-theoretic measure with a predictive interpreta- 
tion that directly quantifies the strength of the effect of one 
user's content on another's in a model-free way. Estimating 
this measure is made possible by combining recent advances 
in non-parametric entropy estimation with increasingly so- 
phisticated tools for content representation. We demonstrate 
on Twitter data collected for thousands of users that con- 
tent transfer is able to capture non-trivial, predictive rela- 
tionships even for pairs of users not linked in the follower or 
mention graph. We suggest that this measure makes large 
quantities of previously under-utilized social media content 
accessible to rigorous statistical causal analysis. 

Categories and Subject Descriptors 

H. f.f [Systems and Information Theory]: Information 
Theory; H.3.4 [Systems and Software]: Information net- 
works; J. 4 [Social and Behavioral Sciences]: Sociology 
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I. INTRODUCTION 

While the emergence of various online social networking 
platforms provides a steady source of data for researchers, 
it also provides a source of constantly evolving complexity. 
Most prior research has focused on analyzing various static 
topological properties of networks induced by social commu- 
nication, while discarding the content of communication. At 
the same time, there is a growing recognition that a more nu- 
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anced understanding of social interactions requires analyz- 
ing the semantic content of communications. For instance, 
it has been suggested that linguistic cues in communicative 
patterns, as well as the ways individuals echo and accommo- 
date each other's linguistic styles, can be indicative of rela- 
tive social status of participants [6]. Despite recent progress, 
however, content-based analysis of social interactions is still 
a challenging problem due to the lack of adequate quantita- 
tive methods for extracting useful signals from unstructured 
text. Another significant hurdle is that the design and us- 
age of social networks, and thus the interpretation of various 
signals, are changing over time. 

Here we suggest a novel, information-theoretic approach for 
leveraging user-generated content to characterize interac- 
tions among social media participants. Specifically, given all 
the content generated by a set of users (e.g., a sequence of 
tweets), our goal is to find meaningful edges that indicate 
social interactions among this set of users. Our approach is 
model-free in the sense that it does not presuppose a particu- 
lar behavioral model of users and their interactions. Instead, 
we view users as producers of some arbitrarily encoded in- 
formation stream. If Y's stream has an effect on X's, then 
access to Y's signal can, in principle, improve our prediction 
of X's future activity. This is what we mean by a predictive 
link. We show that this general notion of predictability can 
be used to identify social influence. 

The technical approach proposed here consists of two main 
ingredients (see Fig. [T|. First, we represent user-generated 
content in a high-dimensional space so that a sequence of 
user-generated posts is mapped to a time-series in this space. 
Second, we apply information-theoretic measures to those 
time series to discover and quantify directed influence among 
the users. Because our method is based on information- 
theoretic principles, it is easy to interpret, applicable to ar- 
bitrary signals and/or platforms, and flexible with respect 
to the representation of content. 

Our approach ultimately reduces to calculating an 
information-theoretic measure called transfer entropy be- 
tween pairs of stochastic processes [32] . Intuitively, transfer 
entropy between processes X and Y quantifies how much 
better we are able to predict the target process X if we use 
the history of the process Y and X rather than the history of 
X alone. By using transfer entropy as a statistical measure 
of the relationship between the content of Y's tweets and the 
content of X's subsequent tweets, we construct a graph of 
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next tweet generated by X, denoted by X ; see Fig.[l] Gen- 
erally speaking, X F is a random variable that can depend on 
a large number of factors that might not be directly observ- 
able: topical interests of user X (and her friends), exogenous 
events, and so on. Here, however, we are interested in the 
extent to which X F is influenced by the past tweets X p 
and Y p . Namely, we would like to see how much knowing 
the past content generated by user Y, Y p , helps us to better 
predict X F . If knowing Y's past tweets helps us to predict 
X F more accurately, then we can say that Y exerts certain 
influence on X. 



Figure 1: Is the content of X's future tweet, X , 
predictable from past tweets, Y P ,X P ? We answer 
this question by first transforming the text of tweets 
into vectors. Joint samples of these variables can be 
used to estimate information transfer, or transfer 
entropy, quantifying how predictive Y's tweets are 
for X's future tweets. 



predictive links, based only on the content of users' tweets. 
Our results demonstrate that transfer entropy indeed reveals 
a variety of predictive, causal behaviors. Surprisingly, we 
also discover that many of the most predictive links are not 
present in the social network, through mentions nor friend 
links. Nevertheless, in Sec. |4.4| we verify the meaningful- 
ness of our measure by showing that predictive links are a 
statistically significant predictor of mentions on Twitter. 

To summarize, our main contribution is a novel application 
of an information-theoretic framework to content-based so- 
cial network analysis, providing a general, flexible measure 
of meaningful relationships in the network. This construc- 
tion is made possible by two apparently novel technical in- 
sights. (1) Current state-of-the-art methods for estimating 
entropic measures such as mutual information continue to 
perform well in high-dimensional spaces as long as they are 
effectively low-dimensional in some sense. (2) While content 
representations of user activity are high-dimensional, they 
are effectively low dimensional in the required sense. Taken 
together, these two points allow us to successfully apply en- 
tropic estimators in a previously inaccessible regime. 

After motivating our technical approach and defining the 
relevant information-theoretic quantities in Section|2j we de- 
scribe how to estimate those quantities in Sec.[3j and demon- 
strate their use on real-world data from Twitter in Sec. [4] 
Finally, we give an overview of related work in Sec. [5] fol- 
lowed by a discussion of results in Sec. [6] 

2. TECHNICAL APPROACH 
2.1 Motivation 

Let us consider a set of users that generate a time-stamped 
sequence of text documents, e.g., tweets. Let X and Y be two 
such users. With a slight misuse of notation, let X p and Y p 
denote the content of tweets generated by those users up to 
some time, in some representation. While our approach is 
not limited to a particular representation, below we will use 
X p and Y p to describe topical representation of the content 
(e.g., obtained via Latent Dirichlet Allocation, or LDA). 

Consider now the problem of predicting the content of the 



The notion of influence (or causality) described above is 
taken in the sense of Granger causality [To] which demands 
that (1) the cause occurs before the effect; (2) the cause con- 
tains information about the effect that is unique, and is in 
no other variable [l2]. In practice, determining that infor- 
mation is "in no other variable" is difficult. For determining 
a causal effect on a user in a social network, we only at- 
tempt to rule out the user's recent past as an explanation. 
Exogenous and long-term effects are difficult to account for 
but will be discussed in some interesting cases. The princi- 
ple behind Granger causality was originally applied in the 
context of regression models, but applying these ideas in 
the context of information theory leads to effective tests of 
causality [12] . 

2.2 Transfer Entropy 

We denote by H(X) the entropy of a random variable, 
X, with some associated probability distribution, p(x) = 
Pr(X = x), for x g R da! . In this case (differential) entropy is 
defined in the standard way, using the natural log, 

H(X ) = E(- logp(a;)) = - j dx p(x) \ogp(x). 

We sometimes speak of entropy as quantifying our "uncer- 
tainty" about X. Standard higher order entropies such as 
mutual information and conditional entropy can be defined 
in terms of differential entropy as H(X : Y) — H(X) + 
H{Y) - H(X, Y) and H(X\Y) = H(X, Y) - H(Y), respec- 
tively. Conditional information can be interpreted as the re- 
duction of uncertainty in X from knowing Y. 

Transfer entropy, or information transfer 132] , can be defined 
as, 



ITy^x = H(X F :Y F \X F ) 

= H{X F \X F )- H(X F \Y F ,X F ), 



(1) 



where X F is interpreted as information about user X's fu- 
ture behavior, and X F , Y F as user X and Y's past behavior, 
respectively. The temporal indices dictate that cause should 
come before effect, and conditioning on X's past insures that 
any explanatory value from Y is not already present in X's 
past behavior. The first line writes this quantity succinctly 
as a conditional mutual information while the second line 
has the nice interpretation that we are interested in how 
much knowing Y p reduces our uncertainty about X F . This 
quantity is asymmetric, so in principle ITy^x 7^ ITx^>y- 
We will see examples where this is the case. 

In this paper, we take Y F ,X F ,X F to be random processes 
representing the content of tweets for users X and Y , and so 
we refer to this measure as content transfer. In particular, 



referring to Fig.[T] given some concrete procedure to turn an 
individual tweet into a vector, x, we consider samples, i — 
1 ... TV, of triples of tweets (a?' 1 , if' l ,x F ' 1 ) representing X's 
tweet, Y's most recent previous tweet, and X's most recent 
previous tweet. Note that we demand that Y's tweet should 
occur after X's previous tweet, otherwise the causal effect of 
V's tweet is already being taken into account as affecting X's 
previous tweet . Also, Y could tweet many times in between 
X's tweets, but we only consider the most recent tweet for 
simplicity. Many possibilities exist to represent a tweet as a 
vector and in Sec. |4.2| we describe some of them along with 
the topic model approach used in this paper. 

Note that we could drop the conditioning on X p to get 
the mutual information between X's future and Y's past, 
sometimes called time-delayed or time-shifted mutual infor- 
mation. We will compare this simpler quantity to transfer 
entropy below. On the other hand, in principle, we could add 
even more conditioning on other variables like news stories, 
other users' activity, Z p , or more history for users X and Y. 
The difficulty is in estimating these entropies from sparse 
data, and adding more conditions also increases the dimen- 
sionality of the problem. 

3. ESTIMATING TRANSFER ENTROPY 

Generally speaking, calculating entropic measures for high- 
dimensional random variables is problematic due to data 
sparsity [24] . Rather than binning data and estimating prob- 
ability distributions as a prerequisite for calculating entropy, 
Kozachenko and Leonenko [l7] introduced an entropy esti- 
mator that was asymptotically unbiased and did not require 
binning of data. Binless estimators were extended to higher 
order quantities like mutual information [18], and divergence 
between two distributions [37] . Below we make use of a gen- 
eralization of the approach that allows binless estimation of 
the more nuanced transfer entropy [33| . 

The basic idea behind non-parametric binless entropy esti- 
mators is to average local contributions to the entropy in 
the neighborhood of each point, where the neighborhood 
size is chosen adaptively according to the point's k near- 
est neighbors. The neighborhood shrinks as we add more 
data, improving the estimate. The fundamental strength of 
this approach comes from the fact that it is easier to locally 
estimate entropy than to locally estimate a probability den- 
sity; mathematically, the equivalent local density estimators 
are not consistent while the entropy estimators are consis- 
tent [26]. For fixed k, various entropy estimators have been 
shown to be asymptotically unbiased and consistent under 
only mild assumptions [37] p] |26] [27] . 
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Generally, suppose we have samples 

each point, i, we construct the random variable e&(i), repre- 
senting the distance to the fc-th nearest neighbor in the joint 
x-y space according to some metric. We will use the max- 
imum norm in all dimensions following previous work [18| 
|33] . For instance, the distance between points i and j in the 
joint space would be 
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dimensions of the x and y spaces, respectively. If we project 
only onto the x (or y) subspace, the number of points strictly 
within a distance et(i) is defined as n x (i) (or n y (i)). We can 
now proceed to write down the Kraskov mutual information 
estimator 18 . 
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Here, ip is the digamma function. Note that this simple ex- 
pression depends only on distances between samples, and 
does not depend on the dimension of the space. 

The Kraskov estimator has been extended to conditional 
mutual information(CMI) [33]. Now we add a third covary- 
ing vector, z, and define eiJJ) as the distance to the fc-th 
nearest neighbor in the full joint x-y- z space, while n yz (i), 
for instance, represents the number of points strictly within 
a distance ek(i) projecting onto the y-z subspace. 



H(X: Y\Z) = iP(k) + 



(3) 



1 

Y, (lKn™(0 + 1) + V»(ny.(») + 1) - V>M0 + 1)) 

These two estimators can interpreted as estimators of time- 
delayed mutual information and transfer entropy, respec- 
tively, through appropriate choice of X, Y, Z. 

Note that because transfer entropy (and mutual informa- 
tion) estimators are an average over all samples, we can 
easily determine the contribution of one sample to the esti- 
mated entropy. 

H (l) (X :Y\Z) = ^(k) + (4) 
ip(n xz (i) + 1) + ip(n yz (i) + 1) - ip(n g {i) + 1) 

We refer to this quantity as local transfer entropy and note 
its similarity 
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to a previously introduced measure 
principle, not only can we identify a pair of users, X, Y 
that Y has high content transfer towards X, we can also 
order their tweet exchanges to see which ones contribute 
most to that assessment. An example is shown in Table[I]m 
Sec. 1431 



3.1 Empirical study of CMI estimators 

Although entropy estimators have many nice theoretical 
properties in the asymptotic limit, for finite sample sizes we 
must ultimately rely on empirical results. Many papers have 
reported impressive empirical results from these entropy es- 
timators already |35| |37| |12| 1 18] , so we will explore only one 
unusual feature of our problem, with surprising results. 

Note that the estimators in Eq. [2] and [3] do not explicitly 
rely on the dimension. In fact, they only rely on the vectors 
themselves through a distance function. Therefore, adding 
extra dimensions to a vector that are constant will have 
no effect on the distance function or the estimator. I.e., we 
would be transforming the vectors x^' , 



->(i)' 



(i) 



CO 



where the c; are arbitrary constants (that is, they are the 
same for each point i). Clearly, the distances are unchanged, 



(similarly in the joint x-y-z 



p(0 _ g0')|| = pW _ gU)'\ 
space), and this is all that is relevant for the estimators in 
Eq. [2] and [3] The question we explore in Fig. [2] is the effect 
of adding extra dimensions which are only nearly constant. 
The intuition for exploring this scenario is that vectors that 
represent content should be high dimensional, but we expect 
most individual users to participate in only a small subset 
of the full content space (we make this intuition concrete 
in Fig. [3]in Sec. 4.2). Can we expect entropy estimators to 
work in this case? 
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Figure 2: (a) Demonstrating convergence for the es- 
timator in Eq. [3](circles) when the true CMI (dotted 
line) is known. This estimator succeeds even in the 
presence of many extraneous dimensions with small 
noise (diamonds). When labels are permuted, the 
estimator converges to zero (squares) (b) Compar- 
ing transfer entropy for pairs of interacting Twitter 
users (circles, continued in inset, and triangles) to 
null hypotheses (diamonds and squares). All points 
include 95% confidence intervals, see Sec. |3.l] for de- 
tails. 

We start by considering a situation in which we can calculate 
conditional mutual information(CMI) analytically. As an ex- 
ample, we consider a Gaussian where x,y,z G R 1 , and we 
take X and Y to be strongly correlated, while Y, Z and X, Z 
are weakly correlated. For illustration purposes, we consider 
the following covariance matrix. 




In this case, H(X : Y\Z) = 0.357, while H(X : Y) = 0.4130 
Because Z is correlated with Y, conditioning on it reduces 

1 Both mutual information and conditional mutual informa- 



some of Y'b usefulness for predicting X. In Fig. |2(a)| we 
attempt to estimate this entropy from random samples. We 
show the mean of the estimator over many trials(circles), 
along with the 95% confidence intervals. For comparison, 
we apply the estimator to samples where the sample indices 
of the y and z components are randomly permuted and it 
converges quickly to zero(squares). 

What happens if we now take x,y,z £ R 150 ? We will let 
Xi,y lt zi be drawn from the same low-dimensional distri- 
bution above, but all the other components of the vector 
will be uncorrelated Gaussian noise with standard deviation 
0.05. The MI and CMI should be unchanged. Now we are 
estimating CMI in a 450 dimensional space with fewer than 
400 samples. Surprisingly, the estimator still works well, only 
slightly underestimating the true CMI. If we had increased 
the standard deviation of the noise, eventually the signal 
would have been lost and the estimator would converge to 
0. 



In Fig. |2(b)| we look at the convergence of the estimator 
for examples from pairs of users on Twitter (details in the 
Sec.[2|. First, we consider a very strong signal corresponding 
to the edge kar — > spo discussed later (circles). The estimate 
increases relatively quickly so that it must be continued in 
the inset. We consider two null hypotheses as comparisons. 
First, we permute the order of tweets for kar, spo and cal- 
culate content transfer(squares). Second, we construct two 
Twitter streams from random tweets in our dataset, and we 
estimate the content transfer (diamonds). Finally, we calcu- 
late content transfer for the user pair no — > li from Sec. |4.4| 
which represents more social behavior (triangles) . Note that 
the estimator in this case is quite noisy, and we do not expect 
perfect discrimination of the signal with so little data. 

3.2 Implementation details 

There are several details to be considered before implement- 
ing the estimators above. First of all, we are required to find 
the fc-nearest neighbors to each point, but how should we 
choose k? Smaller k reduces the bias, but larger k reduces 
the variance [3Tj . We find the results are not very sensitive to 
k. We use k = 3 as suggested by Kraskov et al. [18] and this 
choice is confirmed by numerical results shown in the inset 
of Fig. [7] To avoid situations where two points are exactly 
the same distance away, we also add low intensity (10 -10 ) 
noise to the data [18| . 

The most intensive part of the calculation is the search for 
nearest neighbors. In high dimensions, as is the case in this 
paper, this cannot be sped up much beyond 0(N 2 ) for N 
samples. However, we can fix some constant N c , and only 
make estimates using samples of this fixed size. Besides 
bounding the computational complexity, if we average over 
multiple samples we can also reduce the variance. For N 
samples of tweet exchanges, we take 2\N/N C ] random sub- 
sets of size N c . We set N c — 100 (which will be the minimum 
sample size we keep in our data, discussed in Sec.[4|. This 
also insures that any bias from finite sample size will affect 
all edges equally. Because we attempt to evaluate transfer 



tion must be positive. For our purposes, conditioning on 
another variable generally reduces the mutual information, 
but it is possible for conditional mutual information to be 
larger than the associated mutual information. 



entropy between all of the millions of user pairs (only some 
of which have the minimum number of samples) , we had to 
split our calculation over many processors r] 

4. RESULTS 

We will apply the entropy estimators to real world data from 
Twitter. After describing the dataset, we will discuss options 
for representing tweet text as vectors. In Sec. |4.3| we will ex- 
amine the directional links with the highest content transfer 
on the entire dataset. Although we lack a ground truth to 
validate our results, in Sec. |4.4| we consider activity of a 
subset of users for whom mentions can be used to test the 
significance of content transfer. 

4.1 Dataset description 

We make use of data originally collected and described in 
[30| . All tweets are collected for a set of 2400 users over a 
one month period from 9/20/2010-10/20/2010. The set of 
users was picked by starting with a small, random initial 
set and constructing a snowball sample using mentions and 
re-tweets. The dataset was constrained to users who self- 
reported in their profile a location in the Middle East. The 
dataset also contained sampled tweets from tens of thou- 
sands of other users who mentioned users in the original set. 
We used those tweets to help train the topic model (giving 
a total of over half a million tweets after preprocessing de- 
scribed in the next section), but we did not consider those 
users when calculating content transfer for pairs of users. 
After eliminating users with less than 100 tweets, we con- 
sidered all possible directed edges among the remaining 770 
users. Note that not all ordered pairs of users had at least 
100 tweet triples as defined in Sec. |2.2| in which case content 
transfer was not calculated. 

4.2 Topic vector representation 

A crucial ingredient in our attempt to apply information- 
theoretic measures to social media is a way to represent 
content as vectors. Luckily, a great deal of work has been 
done on mathematical representations of content, a few ex- 
amples are discussed in Sec. [5] The richness of our results 
are ultimately limited by the quality of content representa- 
tion. On the other hand, higher dimensional representations 
make entropy estimation more difficult. 

Compounding this difficulty, social media presents some 
unique challenges. For instance, on Twitter, messages are 
very short (140 characters), providing little context to de- 
termine what a tweet is about. The use of "netspeak", 
emoticons, and abbreviations challenge traditional models 
of communication. Many languages are represented, some- 
times mixed within a single tweet. Spelling mistakes and 
URLs multiply the number of unique tokens. On the other 
hand, the sheer volume of data provides an advantage that 
outweighs these difficulties. 

Ultimately, we chose to use topic models to represent tweets 
because of convenient off-the-shelf implementations 28 , and 



a growing body of work exploring their applicability to Twit- 
ter [39j [l4] . Our purpose in using a topic model differs from 
standard aims. In particular, our ultimate goal is not to find 



distinct topics with clear interpretations, but to find the 
minimal representation that preserves relevant detail. 

For our experiments, we trained an LDA topic model imple- 
mented in <?ensmi[28|. For pre-processing the text, we fol- 
lowed most of the prescriptions in [14] . (1) We replace all 
URLs with the word "[url]". (2) We replace all words start- 
ing with "@" with the word "[mention]" (3) We remove all 
non-Latin alphabet characters and convert to lower-case. (4) 
We removed a standard list of English stop-words. Because 
we will use mentions later in our validation, step (2) is par- 
ticularly important to insure that our topic model has not 
learned name associations. We also removed all tweets that 
begin "RT @" (re- tweets), since this type of information dif- 
fusion has been well-studied, as discussed in Sec. [5] 




tt Active Topic Dimensions 

Figure 3: The effective dimension of topic vectors 
for most users is low. We consider all users with 
at least 100 tweets. For each component of the 150 
dimensional topic vector, we calculate the standard 
deviation over all tweets from one user. If the stan- 
dard deviation is greater than 0.05, we say that topic 
dimension is "active" for that user. 

Next, we constructed a bag-of-words vector representation 
with the remaining words in the dataset that appeared more 
than once. Each component of the vector represents the fre- 
quency with which one of these words occurred in a given 
tweet. These vectors were transformed using the TF-IDF 
score [^Finally, we used the TF-IDF vectors to learn an LDA 
topic model. We tried topic models with 10, 50, 100, 125, 
150, 175, and 200 topics. At first, we assumed that the lower 
dimensional topic models, while being worse representations 
of the text, would be more amenable to entropy estimators. 
However, larger topic models actually fared much better, de- 
spite the high dimensionality of the vectors. One reason for 
this surprising result is that the effective dimensionality of 
most Twitter users is far smaller than the dimensionality of 
the topic vector. To verify this, we considered all the users 
in our dataset with at least 100 tweets. For each component 
of the topic vector, we calculated the standard deviation 
over all tweets for one user. We define the number of "active 
topics" as those for which the standard deviation was over 
0.05. Although this notion does not conform to a standard 
intuition about what should be considered an "active topic", 



2 A11 results were obtained on USC's HPCC [l]. Code is avail- 
able: http: //www. isi . edu/~gregv/te .py 



Term frequency-inverse document frequency: we use the 
standard definition. If a term occurs / times it is transformed 
to /log 2 D/d, where D represents the number of documents 
and d represents the number of documents containing the 
term. 



it does describe what is relevant for entropy estimation, as 
discussed in Sec. |3.1| The result is shown in Fig. [3] for a 
topic model with 150 topics. The dynamics of most users 
are constrained to a handful of dimensions. The fact that 
the active topics may differ for different users is irrelevant 
for the purpose of entropy estimation. 

4.3 Full graph 

We begin by calculating content transfer for all ordered pairs 
of users with sufficient samples. Unless otherwise specified, 
we use ritopics = 125 in the following examples, although the 
high content transfer edges were insensitive to this choice. 
Looking at the histogram of content transfer for all edges 
in Fig. |4j we note that there are a few obvious outliers. We 
also show the network consisting of only these high content 
transfer edges, with account names abbreviated. Inspection 
of the tweets reveal that these links are all strongly predic- 
tive, we proceed to give several examples. 
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Figure 4: Histogram of content transfer for all or- 
dered pairs of users with sufficient samples. Edges 
with content transfer to the right of the dotted line 
are shown in the inset graph. Many of these edges 
are not present in the follower graph or mention 
graph. 



We start with the edge mza — > zah with sample tweets 
shown in Table [l] and point out some things notably lack- 
ing. For these two users there are no friend links, no men- 
tions, no retweets, no matching URLs (though the shortened 
URLs point to the same stories), and no matching hash tags. 
Nevertheless, even though the text is altered, content trans- 
fer recognizes that zah's tweets are identical in content to 
mza's tweets. We use the local transfer entropy from Eq. [4] 
to order the 288 tweet exchanges. We also read through all 
the tweet exchanges and hand labeled 228 instances in which 
the two users' tweets clearly referred to the same story. The 
probability that the local transfer entropy was higher for 
a tweet exchange which was a duplicate than for a non- 
duplicate one was 0.68. We also note the asymmetry of this 
edge: the content transfer from mza — > zah is 0.24 while 
in the other direction it is only 0.01. This asymmetry is of- 
ten taken to suggest a causal connection [33]. Nevertheless, 
there remains the possibility of an external, mutual cause. 
The impossibility of ruling out such alternatives is one rea- 
son we emphasize the interpretation of content transfer as 
a measure of predictability. Of course, this is a well-known 
caveat regarding Granger causality. A simple explanation in 
this case is that both accounts simply read and post from 
the same news site. However, in that case we would expect 



the order of tweets to sometimes be reversed, causing the 
transfer entropy to be more symmetric. In fact, the order 
of tweets is always preserved. A more nuanced alternative 
is that one of the users is temporally "closer" to the news 
source. E.g., a service like "twitterfeed.com" can automati- 
cally post news stories to your Twitter account the instant 
they are published. 

As opposed to the previous example, the kar cluster contains 
some bi-directed edges. In this case, the users are all follow- 
ing each other, however, once again, no retweets or mentions 
are used. The tweets revolve around sports and some sam- 
ples are shown in Table [2] The tweets are clearly all copies of 
each other. Confirming the previous intuition about tempo- 
ral ordering, the bi-directed edges are duplicates that occur 
in arbitrary order, while the directed edge away from kar 
is reflected by the fact that all posts appear first on that 
account. The account hus appears to be a personal account 
that occasionally included unrelated tweets. The profile of 
that account describes the author as a "sports analyst." 

The remaining clusters in Fig.[4]have similar easy qualitative 
interpretations. The isr cluster users post identical Israeli 
news stories. The remaining edges of the largest cluster also 
revolve around news, mostly of stories in the Middle East. 
The profile of gee lists itself as the Twitter stream of a tech 
news site, while the profile of she lists itself as the founder of 
the same website. Tweets are copied in arbitrary order, lead- 
ing to symmetric content transfer. The edge aad o fri has 
a similar interpretation, with aad's profile declaring himself 
a radio presenter for the internet radio station represented 
by account fri. Again, no mention or follower edges are de- 
clared. 
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Figure 5: Illustration of a scenario when time de- 
layed mutual information is high while transfer en- 
tropy is low. Colors represent repetition of a partic- 
ular tweet. 

We also calculated the time-delayed mutual information for 
all pairs of users. In general, this quantity was correlated 
with the content transfer (see Fig. Ffb. However, there were 
several examples where mutual information was high while 
content transfer was low. The intuition behind this phenom- 
ena is expressed in Fig. [5] To use a concrete example from 
the dataset, we have user Y repeatedly tweeting the same 
message #shoutout for Floods in #Pakistan #pkfloods 
[url] , which we can imagine as a red line in the picture. 
At some point, user Y switches to repeatedly tweeting 
a new message #TEAMF0LL0W 100 FREE MORE TWITTER 
FOLLOWERS ! [url]|^] (blue lines) . Coincidentally, at nearly 
the same time user X switches from repeatedly tweeting In 
My View: HOW TO EXPLODE RESOURCES TO EARN FOREIGN 
EXCHANGE [url] (green lines) to In My View: PAKISTAN 



4 This tweet refers to increasingly popular services to inflate 
the number of followers for an account. Basically, you agree 
to either pay, follow people in the service, or tweet adver- 
tisements about the service in excha nge for followers. This 
service formed a prominent signal in [34] . 



ON THE SHOULDERS OF N.R.O BENEFICIARIES [url][](grey 

lines). From looking at the figure, you can see that a rec- 
line for user Y is quite predictive of a subsequent green 
line for user X, and this is reflected in the high mutual 
information. On the other hand, if you condition on user 
X's past, you can easily predict that a green line is most 
likely followed by another green line, and seeing a red line 
from user Y does not improve that prediction. Therefore, 
transfer entropy is low in this scenario. 
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Table 1: Tweet exchanges between two users with 
high content transfer from raza — > zah. Examples 
were picked which have high and low local transfer 
entropy. 



E.g. one user mentioned "©justinbieber" over seven hundred 
times during the month of our dataset. Instead we want to 
find people who use mentions for conversation, which was 
the primary focus of Macskassy [30]. To do this, for each 
pair of users, we call the number of mutual mentions the 
minimum of the number of times X mentions Y and the 
number of times Y mentions X. We ranked the top 50 pairs 
with the highest mutual mentions. For this set of users (38 
users total), we can reasonably assume that mentions are a 
weak proxy for online influence. The full mention graph for 
this set of users is shown in Fig. [5] 

For this restricted set of users, we see how well content trans- 
fer corresponds to the underlying mention graph. For an 
evaluation metric we use area under the receiver operating 
characteristic curve (AUC) [II]. We rank all edges accord- 
ing to which have the highest content transfer. In principle, 
we would like the edges in the mention graph to have the 
highest content transfer. The AUC can be interpreted as 
the probability that a mention edge (X mentions Y at least 
once) has a higher content transfer (ITy^x) than an edge 
without a mention. As a null hypothesis, we can imagine 
that content transfer scores are random. In that case the 
mean AUC will be 0.5. Since we have 74 mention edges and 
785 non-mention edges, the standard error of the AUC un- 
der the null hypothesis is about 3.5% |11| . The best result 
for AUC (using n topic3 = 125) in Fig. mis 0.648, which is 
over four standard deviations away from the null hypothesis. 
As an alternate perspective, we point out that the precision 
and recall for the top 100 edges were 20%, 28%, respectively, 
which are both about twice the baseline. 
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Table 2: The sports cluster in Fig.[4j User kar always 
tweets first, while the others repeat sporadically and 
in varied order. 



Fig. [7(a)1 shows the AUC for various topic models using ei- 
ther transfer entropy or time-delayed mutual information. 
Because the results are noisy and, in principle, our method 
does not rely on the details of any particular topic model, 
in Fig. |7(b)[ we used the average rank given by multiple 
topic models to predict mention edges. On average, transfer 
entropy slightly outperforms time-delayed mutual informa- 
tion. The inset of Fig. |7(b)"| shows the effect of trying various 
k (number of nearest neighbors in our entropy estimator), 
with -atopics = 100. We see that k — 3 is a good choice, but 
larger values of k give similar results. 



4.4 High mention users 

We saw that the user pairs with the highest transfer entropy, 
while clearly good predictors of content, did not have the 
typical markers associated with "influence": no mentions, no 
retweets, and no following. In this section we consider a sub- 
set of Twitter users with two goals. (1) Can content transfer 
capture more subtle social behavior? (2) Can we use some of 
the metadata about mentions and followers to validate the 
effectiveness of content transfer? 

A first idea would be to consider the set of users who use 
"©mentions" with the intuition that if a user mentions some- 
body, they are responding to them somehow. Unfortunately, 
this intuition would often fail. A major use of mentions is 
to attempt to get the attention of celebrity Twitter users. 

5 We also estimated the mutual information H(X F : X p ), 
and this user was an example that had a high score for this 
measure. Basically, the user tweets many different story ti- 
tles, but repeats each one dozens of times. 



In Fig. [6] we consider the full mention graph of the set users 
with high mutual mentions. We also highlight the top 5 edges 
according to content transfer for ntopics = 125 since it gave 
the best AUC. Four edges are true positives, with example 
tweets shown in Table [3] Some comments are in order about 
how tweet processing affected these examples. First of all, 
we eliminated messages that started with "RT @", but not 
"partial retweets", where a message was prepended to the re- 
tweet. Second, no language detection was done, though elim- 
inating non-Latin characters eliminated most of the foreign 
text which was in Arabic and Hebrew. However, a tweet con- 
taining any Latin characters (including mentions) was still 
represented. Most of the topics represented English words, 
but a few topics contained mostly transliterated Hindu and 
Urdu, or Spanish. Identifying that if one user tweets in a 
language another user will respond in kind is a strongly pre- 
dictive signal. 

The discussion in Table [3] between sh and ta perhaps rep- 




Figure 6: The mention graph for users described 
in Sec |4.4| The five edges with the highest trans- 
fer entropy are highlighted. True positives are thick 
red arrows and false positives are blue dashed ar- 
rows. The green, dotted arrows are mention edges 
for which content transfer was not calculated due to 
insufficient data. 



resents the ideal for detecting social influence. In that case 
we see that if one user broaches a political topic, the other 
will follow suit. Note that this pair of users was detected 
through content dynamics alone, without reference to men- 
tions or even knowledge of follow edges. In fact, the distance 
is not even calculated between tweet vectors for sh and ta, 
the content transfer only looks for a predictable co-variation 
in the content of their tweets. 



5. RELATED WORK 

Much research has focused on characterizing and identify- 
ing influential users that can facilitate information diffu- 
sion along social links. Researchers have suggested different 
characterizations of influentials based on various network 
centrality measures 

HH [lU HI- For Twitter data, various 
influence measures include number of followers, mentions, 
retweets [2], Pagerank of follower network [19] , size of the 
information cascades [2] . More recent work has attempted to 
utilize temporal information through the influence-passivity 
score [29], and transfer entropy [34]. None of those measures, 
however, take content into account. More recently, several 
authors have suggested topic-sensitive influence measures 
such as TwitterRank [38], which takes into account topical 
similarity among the users. Topic-specific re-tweeting behav- 
ior was examined in 
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More generally, there is an increasing trend to use commu- 
nication content for inferring the nature of relationships be- 
tween users [3] [7] [9]. An interesting line of research grounded 
in psycholinguistic theory of communication have studied 
the convergence of communicative behavior among Twitter 
users [5]. In particular, it has been suggested that conversa- 
tional behavior can be indicative of relative social status of 
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Figure 7: A UCs for reconstructing the mention 
graph in Sec. |4.4| for various parameters, (a) Trans- 
fer entropy and MI for various topic models, k — 3. 
(b) AUC using the average ranking for several topic 
models. Inset shows the effect of changing k, for a 
topic model with 100 topics. 
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Twcct 


sh 
ta 

sh 

ta 


@ta tsalk to police officers. 6 prominent policemen of Op 
Cleanup have been killed in last 2 yrs. Still tolerating MQM 
@sh I meant the "participation' 1 of the hijacked public was a 
function of fear pcrp by Talibs. Same thing here, ppl don't 
want 2 die 

@ta what does it serve them?More pathetic f*tards snatching 
their mobiles and wallets? Small-crime is engrained in MQM 

@sh re: "no soul n honor"... well I think MQM zia's creation 
to puncture the Sindh Nationalist cause. ISI _will_ slap its b* 


fz 
fz 


@fz oye oyc Aj j M J kc barri hai . . tm kal sa lyrics tweet ker 
rahe ho .. khariat hai na larki ? 

@EN Jo b dimagh mai ata hy krdaiti hun m not listening to 
him atm though 

@fz mujhc na urdu na cnglish songs ka lyrics kabhc yadd howa 

hain ...bhalla bhae tcra :P 

@EN Lol i love memorizing songs ;) 


fz 


NEW VERSION OF TWITTER IS HERE ... 

u got it? :0 RT @EN NEW VERSION OF TWITTER IS HERE 


fz 


@fz YEAH I THINK SO ... YOU GOT IT ??? SPLIT SCREEN 
VERSION ? 

@EN no :( m still waiting for it 


li 
li 


queremos unaa fotooooo deec ! ^'k;elebl y 'i'iceleb2 
QUIERO UNA FOTO DE @celebl & @celeb2 

Occlcbl 

duele tanto deeir ALGO ? 

@ccleb2 nico porfi saca una foto con cmi :( 




@No [Hebrew characters] 
@Li @Rc [Hebrew characters] 




@re twiitcam baby, yes no?! 

@No yesssss, and my brother will be thcirr !! hahah , your 
sweet 

@R.c jaja! very good sister! :) 



Table 3: Representative examples of tweet exchanges 
between the four pairs of users identified as being 
among the top 5 highest transfer entropies for all 
user pairs defined in Sec. |4.4[ Edges were no — > re, 
no — > li, en — > fz, ta — > sh. User's re,no,li tweet pri- 
marily in Spanish, but all three occasionally address 
each other and respond in Hebrew or English. 



participants, and subtle language-based signals can be used 
to infer power relationships among the users [6]. Similar to 
our work, those approaches too work by projecting unstruc- 
tured user-generated text onto a multivariate time series, 
in their case using LIWC categories [25] rather than LDA- 
induced topics. However, the influence measures suggested 
in [5j |6] are defined in a rather ad hoc manner, as opposed 
to a more fundamental entropic measure used here. We be- 
lieve that our approach based on transfer entropy provides 
a more principled measure of directed influence. 

A crucial component of our approach is based on the abil- 
ity to estimate entropic quantities for very-high-dimensional 
random variables. Due to data sparsity, naive methods based 
on binning are not feasible. The binless approach for entropy 
estimation introduced in [17] has been used for quantifying 
information in neural spike trains [36] . The binless approach 
has been extended for estimating higher order entropic quan- 
tities such as mutual information [18] , divergences between 
two distributions [37], and transfer entropy [33]. We also 
note that a linear version of the transfer entropy known as 
Granger causality [To] has been used recently for uncovering 
predictive causal relationships in neuroscience [16], genet- 
ics [2l], climate modeling [22] and various other applications. 

6. DISCUSSION 

We have seen that using content transfer as a general, sta- 
tistical measure of predictivity captures a wide variety of 
nontrivial behavior on Twitter. Information-theoretic tech- 
niques provide powerful, flexible tools for discovering pat- 
terns in data, but typically are impractical to implement. 
Surprisingly, non-parametric entropy estimation was quite 
effective on a dataset that would be considered small by 
recent research standards. This is despite the fine-grain ap- 
plication of these entropic measures to individual user pairs. 
Extraordinarily, Table[l]suggests the measure may even pro- 
vide a meaningful signal at the level of individual tweets. 

The strongest, most predictive signals discovered in Sec. |4.3| 
were all characterized by some type of news dissemination. 
Most interesting about these results were how many of the 
links appeared to be purposely hidden in the explicit follower 
graph. If news dissemination is for the purpose of promoting 
your internet radio station, as in the fri o aad example, it 
may be advantageous for the accounts promoting your web 
site to appear as independent as possible. Indeed, Twitter 
terms of usage prohibit automatic re-tweets, so if you are 
copying content on multiple accounts, it would be a mistake 
to call attention to the practice by using re- tweets. Ironically, 
for these purposes it may be advantageous to hide the truly 
influential edges, while at the same time it is advantageous 
to accrue as many followers as possible to appear influential, 
even if most of these followers are dummy accounts that are 
not influenced at all. 

We also found a statistically significant result in Sec. |4.4| 
for distinguishing "social influence", i.e., one user eliciting a 
response in another. The evaluation task we performed is 
akin to hearing hundreds of people talking at once and in- 
ferring who is talking to whom, just by the content of their 
statements and without reference to any explicitly declared 
relationships. While the data were not sufficient to distin- 
guish an arbitrary social tie, on average edges identified with 



mentions had a higher content transfer and this effect was 
over four standard deviations from the null hypothesis. One 
of the top examples corresponded to an intuitive notion of 
social influence, revolving around political discussion, but 
the strongest signals were for multi-lingual users. Respond- 
ing in-kind to a certain language is a relatively easy signal 
to identify, at least within a topic model representation. We 
can see from Fig. [2] that to distinguish independent signals 
from correlated ones with transfer entropy requires either a 
strong signal or more data. It would be interesting to see 
what types of social influence are detected with even an 
order of magnitude more data, which is still many orders 
of magnitude away from the amount of data regularly pro- 
cessed by companies like Twitter and Facebook. 

A subtle point about our measure of content transfer is that 
at no point are Y's tweets directly compared to X's tweets. 
Rather, the measure checks if X's future content varies in 
a predictable way based on y's past content. While this 
distinction may be overly general for the purpose of discov- 
ering connections from what topics are being discussed, it 
may be relevant for more subtle social cues. For instance, 
whenever Y makes an aggressive statement, if X always re- 
sponds submissively, this is a predictable, but not matching, 
response. To capture this type of scenario would require con- 
tent representation that includes things like stance, attitude, 
or sentiment, as discussed in Sec. [5] 

Social media is in a state of constant growth and change. 
Subtle changes in the mechanisms that underly social me- 
dia platforms can have dramatic effects on the user behav- 
ior [13]. Results based on detailed modeling of hash tags, 
mentions, or re- tweets may not be relevant for the next gen- 
eration of social media. On the other hand, a measure based 
on information-theoretic principles will remain relevant for 
any communication medium. On a more practical note, by 
providing a model-free way to discover unexpected relation- 
ships in data, information-theoretic analysis is an effective 
tool for data exploration. 
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